Skip to content

Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434)#3377

Merged
dbernstein merged 5 commits into
mainfrom
bugfix/playtime-report-folder-race-condition
May 29, 2026
Merged

Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434)#3377
dbernstein merged 5 commits into
mainfrom
bugfix/playtime-report-folder-race-condition

Conversation

@dbernstein
Copy link
Copy Markdown
Contributor

@dbernstein dbernstein commented May 22, 2026

Description

Three independent layers guard against duplicate Google Drive folder creation in generate_playtime_report:

  1. Redis lock — a RedisLock scoped per (root_folder_id, data_source_name) serialises workers within the same Redis instance (i.e. the same shard). If the lock cannot be acquired within FOLDER_LOCK_ACQUIRE_TIMEOUT (60 s), the worker logs a warning and proceeds without it (graceful degradation — no task stalling).

  2. Random jitter — each task invocation sleeps random.uniform(0, FOLDER_CREATION_JITTER_MAX) (up to 10 s) once before the data-source loop. Independent shards share no Redis instance, so the lock alone cannot coordinate them; a single random offset per run spreads their arrival times across all folder levels without multiplying the delay by the number of data sources.

  3. Pick-oldestget_file passes orderBy=createdTime to files.list, so it always returns the oldest matching folder first. If two groups do race and each creates a copy, every subsequent worker converges on the same canonical (oldest) folder for all levels of the hierarchy and the file upload. Orphaned empty duplicates may exist but no files are misplaced.

Residual risk: The three layers together make the collision window very narrow but do not provide a 100% guarantee. A fully watertight fix would require a coordination mechanism shared across all shards (e.g. storing canonical folder IDs in a shared database). We have chosen to accept this low-stakes residual risk to keep the code simple: the task runs monthly, any orphaned folders are cosmetic (no data loss), and the previous behaviour created duplicates on every concurrent run.

Motivation and Context

Multiple CMs grouped in a shard all write reports to the same Google Drive root. When generate_playtime_report fires at the same scheduled time, all workers check for the folder hierarchy simultaneously. Because Google Drive allows same-named folders and its files.list API is eventually consistent, every worker sees "folder not found" and creates its own copy — producing duplicate same-named directories (e.g. multiple "BiblioBoard" folders). Different CM groups share no Redis instance, so a lock alone is insufficient.

JIRA: PP-4434

How Has This Been Tested?

  • redis_fixture added to test_generate_playtime_reports so the task's Redis call resolves to a real client; FOLDER_CREATION_JITTER_MAX monkeypatched to 0 to keep the test fast.
  • test_generate_playtime_report_folder_lock_contention: pre-acquires the folder lock, patches FOLDER_LOCK_ACQUIRE_TIMEOUT to 0.1 s and jitter to 0, asserts the warning is logged and the file upload still completes (graceful degradation).
  • test_get_file_orders_by_created_time: verifies orderBy=createdTime is present in the Drive API request URI and that the first (oldest) result is returned when multiple exist.
  • All 21 affected tests pass.

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@dbernstein dbernstein added the bug Something isn't working label May 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Claude finished @dbernstein's task in 3m 46s —— View job


Summary

The three-layer approach (Redis lock, jitter, pick-oldest) is well-reasoned and the implementation is generally clean. The addition of the except Exception block in 436a210 correctly addresses tdilauro's concern, but the exception itself is silently discarded — the only evidence in the logs is a misleading "within X seconds" warning that implies a timeout when the real cause may be a Redis connectivity failure. That's the one actionable finding below.

Details

src/palace/manager/celery/tasks/playtime_entries.py:266-277

The except Exception block catches Redis errors and sets lock_acquired = False, but swallows the exception without logging it. Control then falls through to the if not lock_acquired: block, which emits a message saying "within {FOLDER_LOCK_ACQUIRE_TIMEOUT} seconds" — accurate for a polling timeout, but misleading when the real cause is a connection refused or a network error. An operator seeing only that message has no way to distinguish the two cases or diagnose the underlying Redis problem.

A cleaner structure logs each case with an appropriate message and keeps the if not lock_acquired: branch solely for the timeout path:

lock_acquired = False
try:
    lock_acquired = folder_lock.acquire_blocking(timeout=FOLDER_LOCK_ACQUIRE_TIMEOUT)
    if not lock_acquired:
        task.log.warning(
            f"Could not acquire folder creation lock for {data_source_name!r} "
            f"within {FOLDER_LOCK_ACQUIRE_TIMEOUT} seconds. Proceeding without "
            "lock — duplicate folders may be created."
        )
except Exception:
    task.log.warning(
        f"Redis error acquiring folder creation lock for {data_source_name!r}. "
        "Proceeding without lock — duplicate folders may be created.",
        exc_info=True,
    )

except Exception:
# Redis may be unavailable (connection error, timeout, etc.).
# Treat this the same as failing to acquire within the timeout:
# log a warning and proceed without the lock rather than
# aborting the entire report for this data source.
lock_acquired = False
if not lock_acquired:
task.log.warning(
f"Could not acquire folder creation lock for {data_source_name!r} "
f"within {FOLDER_LOCK_ACQUIRE_TIMEOUT} seconds. Proceeding without "
"lock — duplicate folders may be created."
)

Minor: src/palace/manager/celery/tasks/playtime_entries.py:160

The comment "Exposed as a module-level constant so tests can set it to 0" couples the implementation docstring to a test detail that can drift. The test file is the right place for this rationale (tdilauro also noted this).

FOLDER_CREATION_JITTER_MAX: int = 10

@dbernstein dbernstein changed the title Serialise Google Drive folder creation with a distributed Redis lock Serialize Google Drive folder creation with a distributed Redis lock May 22, 2026
@dbernstein dbernstein closed this May 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.37%. Comparing base (d09e784) to head (436a210).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rc/palace/manager/celery/tasks/playtime_entries.py 89.47% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3377   +/-   ##
=======================================
  Coverage   93.37%   93.37%           
=======================================
  Files         503      503           
  Lines       46109    46126   +17     
  Branches     6321     6323    +2     
=======================================
+ Hits        43054    43070   +16     
- Misses       1979     1981    +2     
+ Partials     1076     1075    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbernstein
Copy link
Copy Markdown
Contributor Author

dbernstein commented May 22, 2026

Reopening — the Redis lock is still needed for within-group serialization. Will layer pick-oldest and jitter on top to handle the cross-shard case.

@dbernstein dbernstein reopened this May 22, 2026
@dbernstein dbernstein force-pushed the bugfix/playtime-report-folder-race-condition branch from 59955fd to 67ed73a Compare May 22, 2026 16:53
@dbernstein dbernstein changed the title Serialize Google Drive folder creation with a distributed Redis lock Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest May 22, 2026
@dbernstein dbernstein changed the title Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434) May 22, 2026
@dbernstein dbernstein force-pushed the bugfix/playtime-report-folder-race-condition branch 2 times, most recently from 1250a55 to 07ddf48 Compare May 27, 2026 20:57
@dbernstein dbernstein requested a review from a team May 28, 2026 19:54
Copy link
Copy Markdown
Contributor

@tdilauro tdilauro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! 🎸

One significant question and a minor suggestion below, but I'm approving it now, in case my question has a happier answer than I expected.


# How long (seconds) to wait for the folder-creation lock before proceeding without it.
# Typed as float | int to match acquire_blocking's signature; tests monkeypatch this
# to a small float (e.g. 0.1) to avoid waiting a full 60 seconds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - probably better to put the test comment in the test file, rather than here, since they might diverge.

],
lock_timeout=timedelta(minutes=5),
)
lock_acquired = folder_lock.acquire_blocking(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions that failure to acquire the lock should result in graceful degradation.

Question: Will RedisLock.acquire raise if, for example, Redis is down or not responding for some reason?

If an exception is thrown here, it won't be handled and we won't end up proceeding to the lock-less fallback below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that is a problem: I will fix it. If lock acquisition fails due to an issue with Redis, we should just carry on since the purpose of the lock is to reduce the probability of generating duplicate directories not eliminate it entirely.

dbernstein and others added 5 commits May 29, 2026 11:04
…(PP-XXXX)

Multiple Celery workers running generate_playtime_report concurrently all call
create_nested_folders_if_not_exist at the same time. Google Drive allows
same-named folders and its list-API has indexing latency, so all workers see
"folder not found" and each creates a duplicate. A distributed Redis lock
scoped per (root_folder_id, data_source_name) serialises the check-and-create
step. If a worker cannot acquire the lock within FOLDER_LOCK_ACQUIRE_TIMEOUT
seconds it logs a warning and proceeds without the lock (graceful degradation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aces

Three independent layers now guard against duplicate folder creation:

1. Redis lock (existing) — serialises workers within the same Redis instance
   (i.e. the same CM group), eliminating intra-group races entirely.

2. Random jitter (new) — each worker sleeps random.uniform(0, 30) seconds
   before acquiring the lock, spreading independent CM groups in time and
   reducing the probability that two groups race to create the same folder.

3. Pick-oldest (new) — get_file now passes orderBy=createdTime to files.list,
   so it always returns the oldest matching folder first.  If two groups do
   race and each creates a copy, every worker converges on the same canonical
   (oldest) folder for all subsequent levels and the file upload.  Orphaned
   empty duplicates may exist but no files are misplaced.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e loop

- Change FOLDER_LOCK_ACQUIRE_TIMEOUT type from int to float | int so it
  matches acquire_blocking's signature and tests can monkeypatch it to
  a small float (e.g. 0.1) without mypy complaints.
- Reduce FOLDER_CREATION_JITTER_MAX from 30 to 10 seconds.
- Move the single jitter sleep before the data-source loop so the delay
  is incurred once per task invocation rather than once per data source,
  keeping per-data-source overhead constant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If Redis is unavailable (connection error, timeout, etc.) acquire_blocking
raises rather than returning False, bypassing the fallback path entirely.
Catching the exception and treating it as lock_acquired=False ensures the
task proceeds without the lock — logging a warning — instead of aborting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dbernstein dbernstein force-pushed the bugfix/playtime-report-folder-race-condition branch from 5d02f4b to 436a210 Compare May 29, 2026 18:04
@dbernstein dbernstein enabled auto-merge (squash) May 29, 2026 18:04
@dbernstein dbernstein merged commit d8611d4 into main May 29, 2026
20 of 21 checks passed
@dbernstein dbernstein deleted the bugfix/playtime-report-folder-race-condition branch May 29, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants